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Abstract 

The problem of representing text documents within an Infor- 
mation Retrieval system is formulated as an analogy to the 
problem of representing the quantum states of a physical sys- 
tem. Lexical measurements of text are proposed as a way of 
representing documents which are akin to physical measure- 
ments on quantum states. Consequently, the representation of 
the text is only known after measurements have been made, 
and because the process of measuring may destroy parts of 
the text, the document is characterised through erasure. The 
mathematical foundations of such a quantum representation 
of text are provided in this position paper as a starting point 
for indexing and retrieval within a "quantum like" Informa- 
tion Retrieval system. 

Introduction 

The problem of indexing, i.e. generating compact and in- 
formative representations of documents, is an important is- 
sue in Information Retrieval (IR). For text documents, the 
most successful representations have been based on the oc- 
currence of terms in documents. Either their presence or 
absence, or some statistical information about the term's 
occurrence in the document. Consequently, a document is 
represented as a array of terms, and assumed to be fixed 
or static in nature. These representations are used in stan- 
dard IR models such as the Boolean model, Binary Inde- 
pendence Model (BIM), Vector Space Model, Language 
Model, etc (|van Rijsbergen 19791 |Ponte & Croft 1998 



|Salton & Lesk 1968] ) where the representation employed 
tends to be dictated by the model. For example, both the 
Boolean model and BIM expect a binary representation, 
whereas the Language Model expects a probability distri- 
bution over the vocabulary. 

In this work, a different approach is taken, where instead 
of focusing directly on building an IR model, the focus is 
put on devising an underlying representation of documents, 
which is inspired by Quantum Theory (QT). Such a repre- 
sentation should be suitable for being used by an IR system. 

An important part of physics deals with the problem of 
representing in the state of a system, the information an ob- 
server can obtain from a set of measurements. QT provides 



a solution in which measurements on a quantum system can 
be obtained to provide a representation of the state of the 
system. This theory is based on the science of natural ob- 
jects (i.e. photon, electrons, etc). However, IR is a science 
of artificial objects (i.e. text / documents) ( |van Rijsbergen| 
2004). Consequently, it is necessary to explain how QT can 
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be applied in the context of IR. 

Documents can be thought of as states of a physical sys- 
tem, and their features (such as terms), can be viewed as 
physical observables to be measured in such system. If a 
suitable definition of the measurements to be performed on 
documents is used, then the powerful theoretical machinery 
of QT can be engaged to represent and use the information 
obtained. The main contribution of this position paper is 
to define suitable lexical measurements which can be per- 
formed on text which will form the basis for a document 
representation scheme. 

Historically, the most successful methods for automatic 
indexing of text documents have been based mainly on 
the statistical analysis of the occurrence of terms in docu- 
ments ( |Sparck- Jones 2003] >. It is reasonable, therefore, to 
propose measurements which are based on the features re- 
lated to the frequency of occurrence of terms in text docu- 
ments. These will be referred to as lexical measurements. 

In the next section, lexical measurements on text docu- 
ments are proposed and defined, and it is shown how these 
measurements reflect the properties of ideal quantum mea- 
surements. Then, operations between the measurements are 
defined, which enable different relationships to be captured. 
The proposed measurements are then discussed and direc- 
tions for further work outlined. 

Lexical measurements on Textual Documents 

In a physical system, the state of the system is defined by 
the probabilities of the possible outcomes of measurements 
performed on that system. However, the state of a quantum 
system can only have some of the measurement outcomes 
determined, not all of them. For example, there is an im- 
possibility of determining both position and velocity of an 
electron (Heisenberg indeterminacy principle): only one of 
the two properties can be determined with certainty, while 
the other becomes uncertain when the first is determined. 

For some pairs of measurements, the value of the corre- 
sponding observables will not depend on the order in which 



Figure 1 : The Selective Eraser 
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The big white box represents a document, the gray areas represent 
the chosen window, and the dark gray squares represent the occur- 
rences of the chosen term in the middle of each window. 

the measurements are performed. In that case, measure- 
ments are compatible . Other measurements, however, in- 
terfere with each other, in such a way that the obtained out- 
comes depend on the order in which measurements are per- 
formed. These are incompatible measurements. 

A maximal set of compatible measurements can be per- 
formed in order to determine the state to a maximum extent, 
but measurements that are incompatible will not have their 
outcomes determined. This maximal set chosen by the ex- 
perimenter can be thought as an experimental context, and 
the system will have less information about the outcomes 
of other sets of measurements that are incompatible to the 
chosen ones. 

The problem of using information from measurements to 
represent the state of a system can be formulated as a very 
general representation problem. It is possible to adopt the 
view of lexical measurements as physical measurements on 
a system, and use a sophisticated representation scheme bor- 
rowed from physics. The involved measurements must then 
be defined in such a way that they are akin to the proper- 
ties of physical measurements, described above. The pro- 
posed lexical measurements are based on measuring the co- 
occurrence of terms within documents to act like a measure- 
ments on a quantum system. Counting would be viewed as a 
projective operation on text documents, via certain transfor- 
mations of the document that are defined in the next section. 

Selective Erasers 

The proposed approach is based on the definition of certain 
transformations that can be applied to text documents. These 
transformations will be called Selective Erasers and are de- 
noted by E(t,w), where t is a chosen central term, and w 
is the number of preserved terms on either side of the oc- 
currence of t. Applying a Selective Eraser amounts to eras- 
ing every term in the document not falling within a window 
of text (a sequence of terms in the document) centred in an 
occurrence of f, which includes w tokens to the left and w 
tokens to the right (see figure 1). Thus, the total size of 
the window is 2 x w+ 1 tokens. We can define a transfor- 
mation E(t,w) that converts document D into document D' 
with some erased tokens, such that E(t,w)D = D' . 

In the following subsection, the "quantum like" properties 
of erasers are described along with the operations that can 
be performed on them, which extend the possibility of using 



them beyond simple co-occurrence measures, and form the 
basis of a "quantum like" IR system. 

Properties of Selective Erasers 

According to Beltrametti & Cass inelli| ( |1981[ ), ideal quan- 
tum measurements need to satisfied three important proper- 
ties: (1) idempotency (projection postulate), (2) an ordered 
structure, and (3) the possibility of being non-commutative. 
Given the definition of the Selective Eraser, each property is 
fulfilled as described below: 

1 . They are idempotent: applying them any number of times 
is the same as applying them once. For example: let doc- 
ument D ="to be or not to be, that is the question". If 
we apply E(is,2) to D, we are left with D' ="be, that is 
the question". If we apply it again, it will not perform 
further deletions, because all terms are within the window 
already: E(is,2)D = E(is,2) [E(is,2)D\. 

2. They have order relations (see figure 2). If applying E\ 
and then E 2 gives the same result as applying E\, then we 
can say E 2 ^E\. With the same example, we can compare 
E(is,2) and E(is,3). If a term is erased by E(is,3) it will 
also be erased by E(is,2), but not necessarily the other 
way around. In our example with D, both would erase "To 
be or not", but only E(is,2) would erase the second "to". 
So, we could say that E(is,3) ^ E(is,2) because E(is,3) 
will always leave unchanged the same terms as E(is,2), 
and possibly other terms. The mathematical definition of 
the order relation is: 

Ei > E 2 VDj : E 2 = E 2 D, (1) 

When they do not have an order relation, we can say they 
are incompatible, and represent that relation with the sym- 
bol 

£1 g E 2 -(£1 ^ E 2 ) A-i(E\ ^ E 2 ) (2) 

Figure 2: Order relations between compatible erasers 
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Here the lighter gray areas represent one eraser, and the dark areas 
another. These two erasers are said to be compatible because the 
result is the same in any order: they commute. They also show an 
order relation: one of them includes the other because it preserves 
the same parts of the document, plus others. 

3. They do not always commute. When some terms in a doc- 
ument are erased by both projectors E\ and E 2 , and some 
occurrences of the central term f, of one is amongst them, 
it is easy to see that applying the erasers in a different 
order produces a different result (see figure 3). 



This is similar to the situation we find with measurements 
in QT: there are particle-like properties, such as posi- 
tion, that are incompatible with wave-like properties, such 
as wavelength (closely related to velocity). Measuring 
a particle-like property will always erase part of the in- 
formation about wave-like properties, and the other way 
around, so the result is different when making the two 
measurements in two different orders. 

Operations with Selective Erasers 

An eraser can be thought of as a selection of terms fulfilling 
a certain proposition, like "the term is less than w terms apart 
from an occurrence of term f". If the proposition is false, the 
term is erased, while if the proposition is true, the term is 
preserved. Moreover, we can define composite transforma- 
tions made with erasers in a number of ways. They can be 
noted as proposition themselves, and it is possible to oper- 
ate on them using the usual logical operations, like "not" (->) 
"or" (V) and "and" (A). Three such composite transforma- 
tions that can be defined are: 

1 . The complement: erases every term that is not erased 
by E from the document 

2. The join E\ V ' E 2 erases all the terms that would be erased 
by both of the erasers 

3. The meet E\ A E 2 erases the terms that would have been 
erased by any of the erasers 



Figure 3: Operations between erasers 
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The big white boxes are the same document, and the dark squares 
are the occurrences of the chosen terms. The gray areas are the 
parts of the document preserved for each transformation. 

It is easy to verify that some order relations always hold 
for these composite transformations: 

(E l VE 2 )^E 1 ^(E 1 AE 2 ) (3) 
(-tfi ^ -£ 2 ) (£1 < E 2 ) (4) 

Some order relations arise from the logical characteris- 
tics of the propositions that define the transformations. Let 
proposition Pi define transformation T\ and proposition P 2 
define transformation T 2 . If Pi implies P? (Pi =>■ P2) then 
we can infer a order relation between the transformations 
defined by those propositions: 7\ ^ T 2 . As all the terms ful- 
filling T\ fulfil also T 2 , then T 2 will leave the same terms or 
more than transformation T\ . 

Other order relations between the transformations are not 
determined by the logical structure of the propositions, but 



contingent on the choice of documents. They will hold for 
some documents, but not for others. 

The simplest Selective Erasers are those which erase ev- 
erything but the occurrence of a term. According to the def- 
inition, they would be referred to as E(t,Q). They will be 
represented by 1 -dimensional projectors. If such Selective 
Erasers are applied to each term in the vocabulary then each 
projector will be orthogonal to one another, because if we 
apply one to the document, the result of applying another 
will erase the remainder: 

E(?i,0)£(?2,0)=0 ^ h^t 2 (5) 

The application of an eraser E(tj, 0) on D will produce 
a transformed document, containing only the occurrences of 
ti. Using a counting operation on the transformed document, 
the number of times f, occurs can be obtained. If an eraser 
for each term is applied independently on D, then the term 
frequency of each term can be obtained. This will then re- 
sult in a standard bag-of-words representation of D. For in- 
stance, N(tj,D) = \E(ti,0)D\ where N{t tl d) the number of 
times ti occurs in D, and | . | is the counting operation which 
returns the number of tokens in the transformed document. 

The task of determining co-occurrence of terms in a win- 
dow (Song & Bruza 2003 ), can also be expressed in terms 
of Selective Erasers. A co-occurrence measurement of terms 
ti and tj within a window of length w can be performed in 
a similar way, where the number of times f, occurs in the 
vicinity of tj defined by a window of width w, in a document 
D can be defined by N(ti,tj,w,D) = \E(tj,0) [E(t h w)D] |. 
First, a wide-window Selective Eraser E(ti,w) is applied to 
D, then, a narrow window eraser E(tj,0) is applied, and then 
words are counted in the resulting document. 

Higher order erasers (w > 0), will capture semantic rela- 
tions between terms and will also be reflected in the order re- 
lations. For example, for some documents a Selective Eraser 
centred in the term "George" with some width w will hold 
a relation with those centred in the term "Bush" with width 
w — 1, because the two terms appear together: 

E(George,w) ^ E(Bush,w — 1) (6) 

For other documents, the same would hold for "Kate" and 
"Bush": 

E(Kate,w)^E(Bush,w-l) (7) 

These relations can be used to define different subsets of 
documents (clusters): we could define the class of docu- 
ments where |6]) holds, and the class of documents where |7]) 
holds. While this example is trivial, when bigger windows 
are involved, the representation can include more complex 
particularities in the use of the terms. In future work, we 
hope to explore the potential uses of this idea in a clustering 
scheme. 

Probabilities 

Erasers can be seen as a proposition about a certain word 
(for example: term 1 1 is in the neighbourhood of term t 2 ) that 
can be fulfilled or not by any token in a document (like being 
in the neighbourhood of an occurrence of a certain term). 
As such, they can be given a truth value for every token in a 



document, and such values are logically related for different 
erasers in the ways explained above. But it is also natural 
to assign them probabilities, and this can be done in a very 
simple way by Gleason's theorem ( |Gleason 1957\ . For a 
given state of affairs represented by p, a probability measure 
can be defined for erasers in the following way: 



P(E) = Trace(Yl E p) 



(8) 



where p is a density operator representing the preparation of 
the system: it can be a representation of a single document, 
or a representation of a collection consisting of several or 
all the documents. To assign a meaning to this probability, 
it is important to note that it refers to any token in a docu- 
ment. We could say that it is the probability of any token in 
the document, picked at random, which is left unerased 
by the transformation (eraser) E. Beyond this frequen- 
tist interpretation of the obtained probability, it is possible 
to follow a Bayesian interpretation of quantum probability 
(Caves, Fuchs, & Schack 2002) and define conditional prob- 
abilities that reflect the logical structure inherent in these 
transformations. For example, we can define: 

• Since (jH} can be thought of as the probability of any to- 
ken in the system to be unerased by transformation E, 
given a preparation of the system in state D represented 
by density operator po- 



P(E\D) = Trace(Yl E p D ) 



(9) 



• We can also define the probability of a token not to be 
erased by eraser Ei, given a preparation in document D 
and a previous application of eraser E\ : 

P(E 2 \EiD) = Trace (Yl E2 (Yl El p D Yl E] )) (10) 

• It is even possible to define the probability of an implica- 
tion: 

, , Trace (11/7, (Tl E , pnTl E , )) 

P(E 1 >E 2 D)= m n (H) 

T race (n El p D ) 

All these probabilities are computed from the representa- 
tions of the documents, collections, and erasers. What is 
their relation to lexical, experimental quantities? The re- 
lation is, indeed, simple. We can define, for lexical mea- 
surements in one document, a fraction that will behave as a 
probability: 

, * \ED\ 
F(ED) = ipi (12) 

where \ED\ is the number of tokens in the document after 
applying E, and \D\ is the number of tokens in the initial 
document. Probabilities, as we defined them, can be simply 
equated to these fractions: 



P(E\D)=F{ED) 



(13) 



Mathematical representations for erasers and document can 
be derived from measured fractions F(ED) choosing them 
as to exactly, or approximately, reproduce these numbers 
with the traces of their products. A scheme similar to this 
has been proposed by Mana (2003) for probabilistic data 
analysis, but in a more general context. 



Conclusions 

In this paper we have proposed an approach for the represen- 
tation of documents based on the analogy between lexical 
measurements on documents and measurements on physical 
systems. This approach allows us to represent not only lexi- 
cal features that are used in traditional methods (i.e. bag-of- 
words), but also to include more detailed characteristics of 
the use of words, like co-occurrence. However, the approach 
extends beyond such standard interpretations, and provides 
order relations between propositions about the relative posi- 
tions of words. This provides a novel way in which to in- 
terpret lexical relations that would not be otherwise possible 
without the application of this quantum analogy. 

In the future, we hope to develop practical IR applications 
based on Selective Erasers. To this aim, we will explore 
two main directions: (1) using order relations of Selective 
Erasers as a way to define clusters of documents, and (2) 
formulating an indexing scheme based on a density operator 
representation of documents, that allows the use of the rich 
mathematical structure of Hilbert Spaces to encode semantic 
information about documents. 
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